Introduction

This is the main analysis for my project where I am looking to explain what makes a song popular on Spotify. In this document there is:


The Variables - Audio Features

Here are the audio features which we’re going to be mostly focusing on. How do these help explain the popularity of a song?

Acousticness - detects the presence of acoustic instruments

Danceability - based on rhythm stability and beat strength

Energy - measure of intensity and activity

Instrumentalness - the higher the score, the less vocals the track contains

Liveness - detects the presence of an audience or if the track was recorded live

Loudness - how loud the track is

Speechiness - detects the presence of spoken word, giving rap music a higher score than opera

Valence - how positive the track is, the higher the score the generally happier the feel of the track


Hypothesis Tests on Audio Features

This is an example of the (difference in means) hypothesis testing which was carried out on each audio feature. Please see hyp_testing.Rmd for full hypothesis testing.

Two sample - independent tests

H0: The mean danceability in 1960s is the same as the mean danceability in 2010s

Ha: The mean danceability in 1960s is less than the mean danceability in 2010s

H0: The difference in means in 0 Ha: danceability2020 - danceability1960 <> 0

Exploratory Analysis



Linear Regression Model Build

  • To help me answer my question of what makes a song “popular” on spotify, I decided to build an explanatory linear regression model. This type of analysis is used to determine the strength of the relationship between a response variable and multiple explanatory variables.

  • popularity = b0 + b1x1 + b2x2 + b3x3….bnxn

  • While working on this model building I used a 80-20 Train, Test split method. Meaning I worked on 80% of the data then testing my outcome on the remaining 20%.

  • To start this process I plot popularity against each one of my variables or possible explanatory variables and find the strongest correlation.

  • For full linear model build see linear_model_build.Rmd

  • My strongest correlation was year (the year the track was released) with a correlation of 0.74. I then add this to my model as my first explanatory variable.
model_1a <- lm(popularity ~ year,
               data = train_lm)

summary(model_1a)
## 
## Call:
## lm(formula = popularity ~ year, data = train_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.158  -7.444  -1.631   5.832  54.369 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) -1223.657496     3.764367  -325.1 <0.0000000000000002 ***
## year            0.636047     0.001892   336.2 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 97323 degrees of freedom
## Multiple R-squared:  0.5374, Adjusted R-squared:  0.5374 
## F-statistic: 1.131e+05 on 1 and 97323 DF,  p-value: < 0.00000000000000022

After running my model I’m looking at 3 factors

  • The P-value - is this variable making a significant difference. If the P-value is below the significance level of 0.05 then we can reject the null hypothesis and conclude that correlation between the variables is significant

  • The R^2 - is a measure that indicates how much of the variation of popularity is explained by the year

  • The adjusted R^2 - compensates for the addition of variables. So as we’re building an explanatory model, we don’t want this to drop much lower than the R^2.


Anova Test

I used anova (Analysis of Variance) tests to check that the difference between my new model and previous models was significant.

anova(model_3a, model_2c)
  • With a P-value < 0.001 we can reject the null hypothesis in favour of the alternative, meaning the new model (model_3a) is statistically significant


Final Linear Regression Model


popularity ~ year + danceability + loudness + (-liveness) + explicit

model_6a <- lm(popularity ~ year + danceability + loudness + liveness + explicit,
               data = train_lm)

summary(model_6a)
## 
## Call:
## lm(formula = popularity ~ year + danceability + loudness + liveness + 
##     explicit, data = train_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.595  -7.392  -1.589   5.839  54.268 
## 
## Coefficients:
##                  Estimate   Std. Error  t value            Pr(>|t|)    
## (Intercept)  -1173.381097     4.170239 -281.370 <0.0000000000000002 ***
## year             0.607019     0.002155  281.715 <0.0000000000000002 ***
## danceability     0.028390     0.002044   13.886 <0.0000000000000002 ***
## loudness         0.085082     0.004765   17.855 <0.0000000000000002 ***
## liveness        -0.036995     0.001859  -19.903 <0.0000000000000002 ***
## explicit         1.064796     0.118767    8.965 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 97319 degrees of freedom
## Multiple R-squared:  0.5431, Adjusted R-squared:  0.543 
## F-statistic: 2.313e+04 on 5 and 97319 DF,  p-value: < 0.00000000000000022
  • So here we have the final model. I stopped adding variables as the Adjusted r^2 started dropping and our multiple r^2 was barely going up.

  • All of our P-Values are significant and we have a Multiple R^2 of 0.55, with an adjusted R^2 also of 0.55. This means that 55% of the variance in popularity is explained our other variables.

  • This suggests that the model has moderate explanatory power, as about half of the data points can be accounted for by the linear regression line. However, it also tells us that there is a large portion of the variability that remains unexplained and might be attributed to other factors not included in the model.

Logistic Regression Model Build

  • Logistic regression is a statistical analysis method to predict, or explain a binary outcome, such as yes or no, based on prior observations of a data set

  • Rather than using the popularity score I used a variable which I created called is_popular. This splits the data in to a logical type, so it’s TRUE if the song has a popularity score of 50 and above, and FALSE if below 50

  • This was built in a similar to the linear model. I look for correlations, and I add them to my model 1 at a time, checking they are significant. The main difference is that I’m looking for for a high AUC score this time, rather than the multiple R^2 I was looking for in the linear model

  • For full logistic model build see logistic_model_build.Rmd

  • I decided to no longer include the year or decade the song was released for this model. It had such a large influence on the linear model I thought it would be more interesting to see how the logistic model fared without it. Also, if we’re building a model to assist with the writing of a current day “popular” song, then including variables such as year and decade are of no help

Final Logistic Regression Model


is_popular ~ loudness + explicit + danceability + no_of_artists

model_4_final <- glm(is_popular ~ loudness + explicit + danceability + no_of_artists,
             family = "binomial",
             data = train_log_mod)

summary(model_4_final)
## 
## Call:
## glm(formula = is_popular ~ loudness + explicit + danceability + 
##     no_of_artists, family = "binomial", data = train_log_mod)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3163  -0.8544  -0.6608   1.1311   3.7310  
## 
## Coefficients:
##                 Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)   -7.8298932  0.0998028  -78.45 <0.0000000000000002 ***
## loudness       0.0787007  0.0012278   64.10 <0.0000000000000002 ***
## explicit       1.0275451  0.0238377   43.11 <0.0000000000000002 ***
## danceability   0.0098224  0.0004547   21.60 <0.0000000000000002 ***
## no_of_artists  0.1895800  0.0115352   16.43 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121019  on 97324  degrees of freedom
## Residual deviance: 110098  on 97320  degrees of freedom
## AIC: 110108
## 
## Number of Fisher Scoring iterations: 4
## Area under the curve: 0.7219

  • We have a slightly different result this time, with an improved model
  • ROC curve shows the performance of my binary classification model (model_4_final). It illustrates the trade off between the true positive (Sensitivity) and the false positive (1 - Specificity) for different classification thresholds
  • A perfect model would have an AUC score of 1 and a random classifier would have an AUC score of 0.5
  • Here we have an AUC score of 0.72 which indicates that the model performs much better than random chance with but still has room for improvement.

  • I thought it would be interesting to see how my Final Model created using all of the data from 1960 - 2020 performed against some of the different decades. So I split the data in 2 and ran my model on the 60s, 70s and 80s, and then again on 90s, 00s, & 10s
  • We can see the model performed rather poorly on the older decades with an AUC score of 0.6, which tells us it is only slightly better than a random classifier
  • It performed slightly better on the more recent decades with an AUC score of 0.62
  • I believe this shows us that even though we have removed the year from the model, the model is still weighted towards newer music

Genre Analysis

  • Here is a Genre analysis showing the proportional change of the Top 5 Genres from 1960 - 2020

  • I kept the Genres and Number of Followers out of my model building as I was missing almost 50% of the data

  • Here we see a massive decline in the proportion of folk and soul songs from 1960s and 70s to 2010s

  • A rise in Rock music from the 60s with a spike in the 1980s then dropping slowly to 2010s

  • A gradual rise in Rap and Pop over the decades, with Pop just overtaking Rap. Combined, they make up around 40% of the songs released in the 2010s

Conclusion

  • The best model is my logistic model. It gave me a reasonable AUC score of 0.72 and I feel it really works well as an explanatory model

  • From my Genre analysis I discovered that the most common genres of the last decade are Pop & Rap

  • There is no escaping how heavily weighted the popularity score is towards new music. I plan to recreate this project using only new music. I feel this will give me a better insight in to what gives a song a high popularity score.